{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook is demonstation for challenge \"Vietnamese NLP Continual Learning\". In the past, Underthesea has mainly focused on tuning model. With this project, we create a simple challenge for ourselves to build a continuous learning NLP system."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The August 2021 Challenges Vietnamese NLP Dataset for Continual Learning\n",
    "\n",
    "The August 2021 Challenges include [part-of-speech tagging](https://en.wikipedia.org/wiki/Part-of-speech_tagging) and [dependency parsing](https://universaldependencies.org/)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create environment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2\n",
    "\n",
    "# add project folder\n",
    "import os\n",
    "from os.path import dirname, join\n",
    "PROJECT_FOLDER = dirname(dirname(os.getcwd()))\n",
    "os.sys.path.append(PROJECT_FOLDER)\n",
    "\n",
    "# add dependencies\n",
    "from underthesea.utils.col_analyzer import UDAnalyzer, computeIDF\n",
    "from underthesea.utils.col_script import UDDataset\n",
    "from IPython.display import display, display_png\n",
    "from wordcloud import WordCloud\n",
    "from PIL import Image\n",
    "import matplotlib\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "\n",
    "# init folder\n",
    "DATASETS_FOLDER = join(PROJECT_FOLDER, \"datasets\")\n",
    "COL_FOLDER = join(DATASETS_FOLDER, \"UD_Vietnamese-COL\")\n",
    "raw_file = join(COL_FOLDER, \"corpus\", \"raw\", \"202108.txt\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture\n",
    "raw_file = join(COL_FOLDER, \"corpus\", \"raw\", \"202108.txt\")\n",
    "generated_dataset = UDDataset.load_from_raw_file(raw_file)\n",
    "\n",
    "ud_file = join(COL_FOLDER, \"corpus\", \"ud\", \"202108.txt\")\n",
    "ud_dataset = UDDataset.load(ud_file)\n",
    "\n",
    "generated_dataset.merge(ud_dataset)\n",
    "dataset = generated_dataset\n",
    "\n",
    "dataset.write(ud_file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "analyzer = UDAnalyzer()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "analyzer.analyze_dataset_len(dataset)\n",
    "words_pos = analyzer.analyze_words_pos(dataset)\n",
    "punctuations = set(words_pos[words_pos['pos'] == 'CH']['word'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sent_ids = analyzer.analyze_sent_ids(dataset)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "doc_sents = analyzer.analyze_doc_sent_freq(dataset)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = [item[1] for item in doc_sents]\n",
    "plt.hist(x, bins=40)\n",
    "plt.xticks(np.arange(min(x), max(x)+1, 3))\n",
    "plt.title(\"How many sentences were collected for each doc URL?\")\n",
    "plt.xlabel(\"Number of sentences\")\n",
    "plt.ylabel(\"Frequency\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Stopwords using IDF"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "doc_word_freqs = analyzer.get_doc_word_counters(dataset).values()\n",
    "idfs = computeIDF(doc_word_freqs)\n",
    "print(\"Words with lowest IDFs are candidates for Stopwords!\")\n",
    "stopwords_idf = {k: v for k, v in sorted(dict(idfs).items(), key=lambda x: x[1])[:40]}\n",
    "stopwords_idf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Stopwords using Kullback-Leibler divergence"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from underthesea.datasets import stopwords\n",
    "\",\".join(sorted(stopwords.words))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Actionable Insights\n",
    "\n",
    "We want to explore:\n",
    "\n",
    "* What is word frequencies?\n",
    "* What is word frequencies today?\n",
    "* How many words in this corpus?\n",
    "* What are out of vocabulary words?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### What are words"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "counter = analyzer.analyze_words(dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Remove some (potential) stopwords to get clearer Wordcloud"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "wordlist = [word for word in counter]\n",
    "for word in wordlist:\n",
    "    if word in stopwords_idf or word in punctuations:\n",
    "        del counter[word]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "w1 = WordCloud().generate_from_frequencies(counter)\n",
    "plt.figure(figsize=(16, 12), dpi=50)\n",
    "plt.imshow(w1, interpolation=\"bilinear\")\n",
    "plt.axis(\"off\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Beautiful word cloud for most frequencies words in this corpus."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### What are today words?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "counter = analyzer.analyze_today_words(dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Remove some (potential) stopwords to get clearer Wordcloud"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "wordlist = [word for word in counter]\n",
    "for word in wordlist:\n",
    "    if word in stopwords_idf or word in punctuations:\n",
    "        del counter[word]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "w1 = WordCloud().generate_from_frequencies(counter)\n",
    "plt.figure(figsize=(16, 12), dpi=50)\n",
    "plt.imshow(w1, interpolation=\"bilinear\")\n",
    "plt.axis(\"off\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Trending News"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%javascript\n",
    "    var script = document.createElement('script');\n",
    "    script.type = 'text/javascript';\n",
    "    script.src = '//cdnjs.cloudflare.com/ajax/libs/d3/7.0.1/d3.min.js';\n",
    "    document.head.appendChild(script);\n",
    "    console.log(window.d3)\n",
    "    \n",
    "    var script = document.createElement('script');\n",
    "    script.type = 'text/javascript';\n",
    "    script.src = '//cdnjs.cloudflare.com/ajax/libs/jquery/3.6.0/jquery.min.js';\n",
    "    document.head.appendChild(script);\n",
    "    console.log(window.$)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.display import Javascript\n",
    "from ui import generate_svg_script\n",
    "svg_script = generate_svg_script(dataset.get_by_sent_id(\"1142\").get_ud_str())\n",
    "\n",
    "Javascript(svg_script)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Trending News Today"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## How to Contribute?\n",
    "\n",
    "It's great that you find this project interesting ❤️. Even the smallest contribute is appreciated. Welcome to this exciting journey with us.\n",
    "\n",
    "### You can contribute in so many ways!\n",
    "\n",
    "* [Create more usefull open datasets](https://github.com/undertheseanlp/underthesea/tree/master/datasets/UD_Vietnamese-COL)\n",
    "* [Create more actionable insights](https://github.com/undertheseanlp/underthesea/tree/master/datasets/UD_Vietnamese-COL)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "underthesea_test",
   "language": "python",
   "name": "underthesea_test"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
